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INTRODUCTION 

“Novel  Targets  for  the  Diagnosis  and  Treatment  of  Breast  Cancer” 

Chromosomal  amplifications  contain  multiple  copies  of  a  genome  segment.  Regions  that  are 
consistently  found  to  be  amplified  in  breast  cancer  contain  “driver”  genes  whose  increased  expression 
provides  a  selective  advantage  the  cell,  and  may  also  provide  targets  for  diagnosis  and  treatment.  HER- 
2Neu  is  one  successful  example.  There  is  a  great  interest  in  identifying  other  such  genes,  particularly 
those  that  might  prove  useful  for  treatment  or  diagnosis.  Among  this  category  of  genes,  the  most 
clinically-useful  are  those  that  are  membrane-associated,  such  as  receptors,  transmembrane  genes,  or 
secreted  proteins.  We  proposed  to  use  genomic  methods  to  identify  genes  within  amplicons  that  are  also 
membrane-associated. 

We  proposed  to  identify  these  genes  using  a  novel  screening  method,  which  takes  advantage  of  the  fact 
that  proteins  which  function  at  the  membrane  surface  are  preferentially  translated  from  membrane- 
associated  ribosomes,  and  their  RNA  will  give  a  differential  signal  on  cDNA  expression  arrays.  We  can 
then  select  the  genes  and  ESTs  from  among  them  that  are  contained  within  amplicons,  providing  we 
have  accurate  sequence  localizations  for  both  the  EST  and  the  amplicon.  We  further  refined  our 
objectives  since  the  submission  of  this  proposal  to  target  our  efforts  toward  pleural  metastatic  disease. 

Malignant  pleural  effusions  represent  a  significant  problem  in  the  management  of  patients  with  breast 
cancer.  Approximately  half  of  patients  with  disseminated  disease  will  develop  pleural  effusions  and 
many  of  these  patients  will  have  respiratory  problems  or  other  symptoms  requiring  intervention  [1]. 
These  patients  tend  to  have  a  poor  prognosis  [1],  although  it  is  not  clear  if  this  is  due  to  the  pleural 
metastases  or  their  advanced  breast  disease.  Although  pleural  disease  occurs  frequently  in  advanced 
stage  disease,  it  is  not  inevitable,  suggesting  that  there  are  tumor-specific  features  which  regulate  the 
ability  to  proliferate  in  the  pleural  space.  Understanding  the  basis  of  this  tissue  tropism  might  provide 
some  ideas  in  identifying  therapeutic  targets  for  controlling  this  growth.  To  identify  factors  which  are 
associated  with  pleural  metastases,  we  compared  cell  lines  derived  from  primary  breast  tumors  to 
pleural  metastases,  examining  DNA  copy  number  of  gene  expression  levels.  We  further  refined  our 
database  to  select  amplified  and  over-expressed  genes  that  are  membrane-associated. 

Our  specific  aims  are: 

(1)  To  use  genomic  array  analysis  to  identify  and  delineate  amplicons  in  breast  cancer,  comparing 
primary  to  pleural  metastatic  disease. 

(2)  To  identify  genes  within  these  amplicons  that  encode  surface-expressed  molecules,  using  cDNA 
array  hybridization  of  fractionated  polysomes. 

(3)  To  measure  the  expression  levels  of  these  selected  genes,  identifying  those  that  are  the  upregulated 
targets  of  amplification  in  pleural  metastatic  disease. 

We  hope  to  identify  a  set  of  novel  genes  that  are  overexpressed  in  breast  cancer  with  translational 
significance,  as  they  represent  potential  targets  for  antibodies  (surface  antigens),  receptors  for  external 
factors  that  regulate  cell  growth,  or  circulating  tumor  markers  (secreted  peptides).  Identifying  these 
driver  genes  will  provide  new  insights  and  reagents  for  diagnosis  and  treatment  of  breast  cancer. 


BODY 

Task  1.  To  specify  intervals  of  genomic  amplification  in  breast  cancer  cell  lines  or  explants 
using  genomic  microarrays 

Task  2:  To  prepare  a  database  containing  at  most  9,000  to  15,000  genes  expressed  in  the 
MCF7  breast  cancer  cell  line  that  are  likely  to  be  membrane-associated. 

Task  3.  To  select  genes  within  amplicons  and  measure  their  expression  level,  to  identify  those  that  are 
upregulated 

The  work  comprised  two  major  projects.  The  first  was  the  preparation  of  the  database  of  membrane 
associated-genes  in  breast  cancer  (Task  2)  and  the  second  involved  analysis  of  pleural  mets  vs.  primary 
breast  cancer,  to  identify  amplicons  and  highly  expressed  genes,  and  to  correlate  with  the  membrane 
database,  Tasks  1  and  3. 

I. Preparation  of  the  database  of  surface-expressed  molecules. 

This  was  reported  previously,  and  has  been  published  in  Cancer  Research. 

Stitziel  NO,  Mar,  BG,  Liang  J,  Westbrook  CA,  Membrane-Associated  and  Secreted  Genes  in  Breast 
Cancer.  Cancer  Research,  Cancer  Res.  2004  64:  8682-8687 

A  copy  of  the  manuscript  is  included  in  the  appendix. 


II.  Amplicons  and  gene  expression  in  pleural  metastatic  disease. 

Materials  and  Methods 

All  cell  lines  had  been  developed  from  patient  specimens  of  infiltrating  ductal  carcinoma  collected  at  the 
University  of  Illinois  at  Chicago  Cancer  Center  as  previously  described  [2].  The  tissue  was  taken  from  a 
resected  breast  primary  (for  the  primary  cell  line  group)  or  pleural  fluid  in  patients  with  pleural 
metastases  (for  the  metastatic  cell  line  group).  Each  cell  line  was  cultured  following  standard  cell 
culture  protocols,  and  was  expanded  to  lx  107  before  harvesting. 

RNA  extraction  and  profiling  on  Affymetrix  arrays 

Total  RNA  was  extracted  from  each  sample  using  TRIzol  reagent  (Invitrogen,  Carlsbad, 

CA)  following  the  recommended  manufacturer’s  protocol.  RNA  was  quantified  by  UV 
spectrophotometry  and  the  quality  was  checked  by  ensuring  the  ratio  of  absorbance  at  260/280  ntn  >1.8 
as  well  as  verifying  the  presence  of  ribosotnal  bands  via  gel  electrophoresis.  20_g  of  total  RNA  for  each 
cell  line  was  biotin  labeled  in  two  separate  and  equal  reactions  according  to  the  recommended 
Affymetrix  protocol.  Each  reaction  (2  separate  labeling  reactions  per  cell  line)  was  hybridized  to 
Affymetrix  U133A  oligonucleotide  microarrays  according  to  the  manufacturer’s  recommended  protocol. 
The  microarrays  were  washed  and  scanned  according  to  standard  protocol,  and  the  resulting  CEL  files 
were  processed  as  follows. 


Gene  expression  calculation  and  processing 


The  CEL  files  for  each  of  the  microarray  hybridizations  (14  total  =  3  metastatic  cell  lines 
and  4  primary  cell  lines  all  hybridized  in  duplicate)  were  processed  using  the  Bioconductor  software 
suite  (a  set  of  R  [3]  libraries  available  at  http://www.bioconductor.org).  The  Robust  Multiarray  Average 
(RMA)  algorithm  from  Bioconductor  [4-6]  was  used  for  normalization,  background  calculation  and 
subtraction,  and  expression  value  calculation. 

DNA  copy  number  analysis 

DNA  was  extracted  from  lxlO7  cells  previously  flash  frozen  in  liquid  nitrogen  using  the  Puregene  DNA 
isolation  kit  (Gentra,  Minneapolis,  MN).  DNA  amplification  analysis  was  performed  using  the 
GenoSensor  Array  300  system  (Vysis,  Downers  Grove,  IL)  as  follows,  lpg  each  of  sample  DNA  and 
reference  DNA  (female  DNA  obtained  from  Stratagene,  La  Jolla,  CA)  was  labeled  with  Cy-3  and  Cy-5 
fluorophores,  respectively.  These  labeled  products  were  hybridized  to  the  GenoSensor  Array  300  for  72 
hours  according  to  the  manufacturer’s  specifications,  hollowing  the  recommended  washing  procedure, 
the  slides  were  scanned  using  the  GenoSensor  Reader  System.  Using  the  manufacturer’s  software,  the 
relative  DNA  amplification  for  each  probe  was  calculated  along  with  a  /;- value  indicating  the  statistical 
significance.  A  p-value  of  <  0.01  was  considered  significant. 

Statistical  calculations 


To  determine  which  genes  were  likely  to  be  associated  with  the  pleural  metastatic  phenotype,  we  applied 
statistical  and  differential  expression  filtering  methods  to  exclude  genes  from  the  total  expression  group 
that  were  not  likely  to  correlate  with  the  phenotype  or  were  not  expressed  strongly  enough.  Lirst,  /-tests 
with  equal  variance  were  performed  for  each  microarray  element,  calculating  the  probability  of 
differential  expression  between  the  primary  and  metastatic  group.  The  probability  was  calculated 
directly  from  the  t  distribution  with  12  (6  metastatic  +  8  primary  -  2)  degrees  of  freedom.  We  used  the 
Benjamini  and  Hochberg  [7]  method  of  controlling  the  Lalse  Discovery  Rate  (LDR)  for  multiple 
hypothesis  testing  correction.  The  adjusted  threshold  _  was  set  at  0.05.  To  ensure  each  gene  was 
expressed  at  a  high  enough  level,  we  filtered  out  genes  with  less  than  2-fold  differential  expression 
between  the  primary  and  metastatic  groups.  To  determine  the  top  correlated  genes,  Pearson’s 

correlation  was  calculated  for  each  gene  with  a  vector  of  idealized  phenotypes  (i.e.  [1,1,1, _ ,-l,-l,-l]). 

Lor  clustering,  complete  linkage  was  used  with  (1  -  Pearson’s  correlation)  as  a  distance  metric. 

Results 


Gene  expression  signature  of  pleural  metastases 

To  investigate  the  gene  expression  signature  of  pleural  metastases,  we  elected  to  use  cell  lines  that  had 
been  created  from  either  pleural  metastases  or  primary  breast  tumors.  Seven  cell  lines  were  available  for 
our  analysis;  three  derived  from  pleural  metastases  and  four  derived  from  primary  tumors.  We 
hybridized  RNA  from  these  cell  lines  to  Affymetrix  U133A  microarrays  and  calculated  expression 
values  for  22,283  genetic  elements.  The  global  gene  expression  profiles  from  the  two  groups  were 
compared,  and  after  applying  the  statistical  and  expression  filtering  as  described  above,  107  genes 
(represented  by  121  probe  sets)  were  found  to  show  differential  expression  between  the  two  groups.  We 
define  this  as  the  gene  expression  signature  of  pleural  metastases.  Of  these  107  genes,  65  were 
upregulated  in  pleural  metastatic  disease,  and  42  genes  were  downregulated.  Strong  intra-group 
clustering  was  found  using  these  121  probe  sets  (Ligure  1).  Correlation  with  the  metastatic  phenotype 


was  calculated  for  each  gene,  and  the  top  1 8  genes  were  selected  for  further  analysis  (these  are  listed  in 
Table  1).  A  plot  of  the  expression  levels  of  the  top  correlated  genes  is  seen  in  Figure  2. 


A  unique  metastatic  profile 

To  investigate  the  possibility  that  common  metastatic  genes  were  driving  these  pleural  metastases,  we 
compared  our  pleural  metastatic  profile  with  two  recently  reported  signatures.  A  70-gene  expression 
signature  has  been  reported  to  be  highly  correlated  with  metastasis  and  poor-prognosis  [8].  We 
compared  our  pleural  metastatic  profile  with  the  70-gene  expression  signature  to  determine  if  any  of  the 
poor-prognosis  genes  were  present  in  our  profile.  Of  the  70-genes,  we  found  52  to  be  represented  by 
elements  on  the  Affymetrix  U133A  GeneChip.  Of  these  52,  only  two  were  found  in  our  pleural 
metastatic  profile:  maternal  embryonic  leucine  zipper  kinase  (MELK)  and  nucleolar  and  spindle 
associated  protein  1  (NUSAP1).  Interestingly,  while  both  were  upregulated  in  the  poor-prognosis 
profile,  they  were  both  found  downregulated  in  pleural  metastatic  disease. 

We  also  compared  our  pleural  metastatic  profile  with  a  recently  reported  102-gene  profile  specific  for 
breast  cancer  metastases  in  bone  [9].  Given  that  the  102-gene  profile  was  performed  on  Affymetrix 
U133A  GeneChips,  we  were  able  to  directly  compare  the  results.  Only  one  gene,  tumor  necrosis  factor 
alpha-induced  protein  2  (TNFAIP2)  was  found  in  common  with  these  two  datasets.  While 
downregulated  2.2  fold  in  bone  metastases,  it  was  found  to  be  2.1  fold  upregulated  in  pleural  metastases. 

Genes  highly  correlated  with  pleural  metastases 

To  select  those  genes  whose  expression  is  highly  correlated  with  the  pleural  metastatic  phenotype,  we 
calculated  Pearson’s  correlation  with  an  idealized  phenotypic  vector  (i.e.  [1,1,1,..  .,-1,-1, -1])  for  each 
gene  as  described  above.  Genes  with  a  correlation  coefficient  above  an  arbitrarily  chosen  cutoff  of  0.85 
were  selected  for  further  analysis.  Those  18  genes  are  listed  in  Table  1,  and  a  clustering  based  on  their 
expression  levels  is  shown  in  Figure  2. 

To  investigate  the  potential  role  of  the  genes  that  were  most  highly  correlated  with  the  pleural  metastatic 
phenotype  we  correlated  each  Affymetrix  probe  set  with  one  or  more  Gene  Ontology  [10]  Biochemical 
Process  (BP)  annotations.  One  third  of  the  genes  (6  of  18)  have  no  BP  annotation,  while  another  third 
appear  to  be  involved  in  metabolism  (4  of  18)  or  cell  growth  and  maintenance  (2  of  18).  The  remaining 
probe  sets  with  BP  annotation  are  split  between  transcription  and  translation,  angiogenesis,  and  protein 
phosphorylation. 

A  potential  role  of  gene  amplification 

Increased  DNA  copy  number  has  recently  been  shown  to  play  a  driving  role  in  increased  levels  of  gene 
expression  [1 1,  12],  To  explore  the  possibility  that  gene  upregulation  in  the  pleural  signature  might  be 
associated  with  genomic  amplification,  DNA  copy  number  analysis  was  performed  using  competitive 
DNA  hybridization  on  glass  slides.  The  Vysis  chip  contains  287  targets  spotted  in  triplicate  which 
represent  genetic  segments  that  are  frequent  sites  of  genomic  amplification  and  genomic  loss. 
Fluorescently  labeled  DNA  from  each  of  the  seven  cell  lines  was  hybridized  to  the  chip  in  competition 
with  fluorescently  labeled  normal  female  DNA  and  scored  for  sites  in  which  there  was  a  statistically 
significant  gain  or  loss. 


The  results  of  this  study  are  shown  in  Figure  3,  which  displays  both  amplified  and  deleted  chromosomal 
regions,  comparing  pleural  to  primary  cell  lines.  While  certain  trends  are  evident,  after  applying  the 
Benjamini  Hochberg  method  of  correcting  for  multiple  hypothesis  testing,  there  were  no  statistically 
significant  amplifications  that  distinguished  the  metastatic  group  from  the  primary  group.  Despite  this, 
several  of  the  genes  in  our  pleural  metastatic  signature  fall  in  regions  found  to  be  amplified  in  one  or 
more  metastatic  lines  at  a  statistically  significant  level  for  that  cell  line,  as  summarized  in  Table  2.  Of 
note,  four  genes  in  the  pleural  metastatic  signature  (two  of  which  are  included  in  the  top  correlated 
genes)  are  found  in  1  Op  11-15,  a  region  which  appears  to  be  highly  amplified  in  one  of  the  metastatic 
lines  (BCA-1)  and  moderately  amplified  in  another  metastatic  line  (BCA-4).  lOpl  1-15  shows  no 
amplification  in  the  primary  lines. 


KEY  RESEARCH  ACCOMPLISHMENTS 

>  Preparation  of  database  of  membrane-associated  and  secreted  genes  in  breast  cancer 

>  Describing  a  gene  expression  signature  that  is  highly  correlated  with  metastatic  disease 

>  Describing  a  set  of  genes  that  is  amplified  in  pleural  metastatic  disease 
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CONCLUSIONS 


The  purpose  of  this  study  was  to  identify  potential  targets  for  the  diagnosis  and  treatment  of  pleural 
metastatic  breast  cancer.  We  chose  to  use  cell  lines  derived  from  patient  tumors  in  order  to  have  a 
model  in  which  our  findings  can  be  tested.  Furthermore,  these  are  early-passage  cultures,  and  as  such 
are  thought  to  be  genetically  similar  to  the  tissue  from  which  they  were  derived,  with  little  cell-culture 
artifact. 

Although  the  sample  size  was  small,  several  interesting  observations  can  be  made.  First,  we  identified  a 
genetic  signature  that  is  associated  with  pleural  metastatic  disease.  There  is  some  evidence  that  this 
program  may  be  unique  to  pleural  metastases  as  it  is  independent  from  a  signature  shown  to  predict 
aggressive  metastatic  behavior  and  poor  prognosis[8],  as  well  as  a  signature  of  bony  metastasis  using  a 
murine  explant  system[9].  The  genes  we  identified  as  a  genetic  signature  of  pleural  metastasis  might  be 
mediators  of  pleural  tropism  in  breast  cancer.  Interestingly,  of  the  18  genes  most  highly  correlated  with 
pleural  metastasis,  half  are  involved  with  cellular  metabolism  or  growth  and  maintenance,  and  might 
point  to  the  biochemical  modifications  necessary  for  specialized  growth  of  cells  in  the  pleural  cavity. 
Such  pathways  might  be  good  targets  for  inhibitory  drugs  that  might  control  tumor  growth  in  the  pleural 
space  rather  than  a  systemic  impact. 

The  role  of  DNA  amplification  was  examined  to  investigate  the  possibility  that  this  mechanism  is 
involved  with  the  gene  expression  changes  observed  in  the  pleural  metastatic  cell  lines.  While  there 
were  no  statistically  significant  changes  observed  between  the  metastatic  and  primary  groups,  there  was 
an  amplified  region  present  in  two  out  of  three  pleural  metastatic  cell  lines  that  contained  two  of  the  top 
1 8  genes  most  correlated  with  the  metastatic  phenotype  and  an  additional  two  from  the  larger  gene 
expression  signature.  While  preliminary,  this  warrants  further  study  to  determine  if  this  amplicon  is 
driving  this  gene  expression  change  in  the  metastatic  lines. 

The  pleural  space  represents  a  good  site  for  drug  delivery  and  patients  would  benefit  if  their  malignant 
pleural  disease  was  controlled.  In  order  to  develop  new  therapies  for  the  treatment  of  pleural  effusions 
from  the  breast  one  needs  to  determine  if  the  pleural  disease  is  similar  or  dissimilar  from  primary  breast 
tumors.  While  this  study  is  limited  by  the  number  of  cell  lines  available  for  analysis,  it  supports  the 
theory  that  pleural  metastatic  disease  represents  a  genetically  distinct  form  of  breast  cancer  with  highly 
correlated  genes  suitable  for  further  analysis  as  therapeutic  targets. 
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Table  1  Genes  most  correlated  with  the  pleural  metastatic  phenotype. 


Table  2  Genes  from  the  pleural  metastatic  profile  found  in  amplified  regions  in 
pleural  metastatic  cell  lines.  DNA  copy  number  is  listed  for  each  cell  line  for  the  given 
genes  from  the  pleural  metastatic  profile.  Amplifications  considered  statistically 
significant  are  shaded  gray.  (Specific  DNA  copy  number  for  DNAJC1  is  unavailable  due 
to  lack  of  genomic  probe  at  that  location,  however  it  is  within  the  amplified  region  in  the 
pleural  metastatic  cell  lines.) 


Figure  1.  Hierarchical  clustering  diagram.  The  expression  of  our  121  probe  set  profile 
was  used  to  cluster  the  14  cell  lines.  (1  -  the  pearson  correlation)  was  used  as  a  distance 
metric  and  complete  linkage  was  used  for  joining  nodes.  The  data  is  displayed  such  that 
each  row  represents  one  probe  set  and  each  column  represents  one  cell  line  hybridization. 
Sample  names  are  below  each  column;  “a”  and  “b”  refer  to  duplicate  hybridizations.  The 
black  and  yellow  bars  represent  pleural  metastatic  and  primary  cell  lines,  respectively. 
Above  the  columns  and  to  the  left  of  the  rows  are  the  hierarchical  dendrograms  for  their 
respective  dimension. 


Figure  2.  Expression  of  genes  most  correlated  with  pleural  metastatic  phenotype. 

Figure  3.  Percentage  of  cell  lines  sharing  a  relative  DNA  copy  gain  or  loss  for 
primary  and  pleural  metastatic  groups.  Statistically  significant  gains  and  losses  are 
plotted  in  green  and  red,  respectively.  Probes  are  plotted  according  to  their  genomic 
position  on  each  chromosome.  Centromeres  are  represented  by  dashed  lines. 
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ABSTRACT 

The  identification  of  membrane-associated  and  secreted  genes  that  are 
differentially  expressed  is  a  useful  step  in  defining  new  targets  for  the 
diagnosis  and  treatment  of  cancer.  Extracting  information  on  the  subcel- 
lular  localization  of  genes  represented  on  DNA  microarrays  is  difficult 
and  is  limited  by  the  incomplete  sequence  and  annotation  that  is  available 
in  existing  databases.  Here  we  combine  a  biochemical  and  bioinformatic 
approach  to  identify  membrane-associated  and  secreted  genes  expressed 
in  the  MCF-7  breast  cancer  cell  line.  Our  approach  is  based  on  the 
analysis  of  differential  hybridization  levels  of  RNAs  that  have  been  phys¬ 
ically  separated  by  virtue  of  their  association  with  polysomes  on  the 
endoplasmic  reticulum.  This  approach  is  specifically  applicable  to  oligo¬ 
nucleotide  microarrays  such  as  Affymetrix,  which  use  single-color  hybrid¬ 
ization  instead  of  dual-color  competitive  hybridizations.  Assignment  to 
membrane-associated  and  secreted  class  membership  is  based  on  both  the 
differential  hybridization  levels  and  an  expression  threshold,  which  are 
calculated  empirically  from  data  collected  on  a  reference  set  of  known 
cytoplasmic  and  membrane  proteins.  This  method  enabled  the  identifica¬ 
tion  of  755  membrane-associated  and  secreted  probe  sets  expressed  in 
MCF-7  cells  for  which  this  annotation  did  not  previously  exist.  The  data 
were  used  to  filter  a  previously  reported  expression  dataset  to  identify 
membrane-associated  and  secreted  genes  which  are  associated  with  poor 
prognosis  in  breast  cancer  and  represent  potential  targets  for  diagnosis 
and  treatment.  The  approach  reported  here  should  provide  a  useful  tool 
for  the  analysis  of  gene  expression  patterns,  identifying  membrane- 
associated  or  secreted  genes  with  biological  relevance  that  have  the  po¬ 
tential  for  clinical  applications  in  diagnosis  or  treatment. 

INTRODUCTION 

With  the  advent  of  high-throughput  global  genomic  strategies,  the 
potential  exists  for  the  identification  of  many  novel  genes  that  have  a 
specific  association  with  cancer,  and  the  gene  product  of  which  has 
diagnostic  or  therapeutic  implications.  The  task  remains  to  determine 
which  of  these  have  the  most  immediate  potential  for  clinical  trans¬ 
lation.  Among  the  most  useful  proteins  in  the  clinical  setting  are  those 
that  are  associated  with  the  cancer  cell  membrane,  including  those 
that  are  membrane-bound  and  those  that  are  secreted  extracellularly 
(referred  to  as  membrane-associated  and  secreted  or  membrane- 
associated  and  secreted  genes).  Membrane-bound  proteins  include 
surface  antigen  targets  for  diagnosis  or  treatment,  receptors  for  exter¬ 
nal  factors  that  regulate  cell  growth,  and  proteins  that  regulate  cell 
adhesion  and  metastases.  Secreted  proteins  and  peptides  can  be  used 
as  circulating  tumor  markers  for  diagnosis  and  monitoring. 

The  characterization  of  a  novel  gene  as  one  that  encodes  a 
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membrane-associated  or  secreted  protein  can  be  difficult.  Although 
computational  methods  exist  for  predicting  whether  a  protein  is 
membrane-bound  or  secreted  (1,  2),  these  methods  cannot  be  applied 
to  incomplete  or  poorly  annotated  gene  sequence  and  are  inexact  even 
in  the  best  setting. 

An  alternative  method  to  identify  membrane-associated  and  se¬ 
creted  genes  experimentally  based  on  differential  hybridization  to 
glass  slide  cDNA  arrays  was  recently  shown  (3).  This  method  takes 
advantage  of  the  fact  that  proteins  that  function  at  the  membrane 
surface  or  are  immediately  secreted  are  preferentially  translated  from 
ribosomes  at  the  endoplasmic  reticulum  to  which  they  are  directed  by 
their  signal  peptide.  Because  their  association  with  the  endoplasmic 
reticulum  membrane  makes  them  less  dense,  these  membrane-bound 
polysomes  can  be  separated  from  their  heavier  cytosolic  counterparts 
by  sucrose  gradient  centrifugation  (4).  RNA  prepared  from  these  two 
cellular  subfractions  is  used  for  differential  cDNA  hybridization  to 
identify  those  that  are  most  highly  associated  with  the  membrane- 
bound  polysomes. 

Here  we  report  the  application  of  this  method  to  the  global  analysis 
of  the  genes  expressed  in  a  breast  cancer  cell  line,  MCF-7,  modifying 
it  for  the  widely  used  Affymetrix  chips.  We  develop  a  statistical 
approach  to  determine  the  membrane  association  of  each  Affymetrix 
probe  set,  as  expressed  in  the  cell  line,  by  comparing  the  ratio  on  two 
chips  to  a  reference  set  of  known  cytoplasmic  and  membrane  proteins. 
The  results  of  this  study  were  then  used  to  analyze  the  data  from  a 
previously  reported  differential  expression  study  in  breast  cancer  (5), 
to  identify  membrane-associated  and  secreted  genes  that  are  associ¬ 
ated  with  poor  prognosis,  demonstrating  the  utility  of  this  approach  to 
identify  potential  targets  for  diagnosis  and  treatment  from  differential 
hybridization  studies. 

MATERIALS  AND  METHODS 
Cell  Line  Preparation 

MCF-7  cells  were  purchased  from  the  American  Type  Culture  Collection 
(Manassas,  VA)  and  were  cultured  in  Eagle’s  MEM  supplemented  with  0.01 
mg/mL  bovine  insulin,  10%  fetal  bovine  serum  in  5%  C02  at  37°C. 

Polysome  Fractionation 

Polysomes  were  fractionated  by  sucrose  density  gradient  centrifugation  with 
a  modification  of  the  method  described  by  Mechler  (4).  After  treatment  with 
cyclohexamide  (10  p,g/mL)  for  10  minutes  at  37°C,  3  X  10s  MCF-7  cells  in 
log  growth  were  collected  by  scraping  the  dishes  into  cold  PBS.  The  cells  were 
then  resuspended  at  a  concentration  of  2.5  X  10s  cells/mL  in  a  hypotonic  lysis 
buffer  [10  mmol/L  KC1.  1 .5  mmol/L  MgCl2,  and  10  mmol/L  Tris-Cl  (pH  7.4)]) 
and  were  allowed  to  rest  on  ice  for  10  minutes.  After  lysing  cells  with  a 
Dounce  homogenizer,  nuclear  and  cell  debris  were  removed  by  centrifugation 
at  2,000  X  g  (4°C)  for  2.5  minutes.  The  supernatant  was  loaded  on  a 
discontinuous-step  sucrose  gradient  (2.5  mol/L,  2.1  mol/L,  1.95  mol/L,  and  1.3 
mol/L  sucrose)  and  centrifuged  at  26,000  X  g  for  5  hours.  After  centrifugation, 
successive  1.5  mL  fractions  were  collected  from  the  bottom  of  the  centrifu¬ 
gation  tube,  and  the  A260nm  was  measured  to  estimate  the  RNA  content. 

RNA  Preparation 

Total  RNA  was  isolated  from  the  pooled  sucrose  gradient  fractions  by 
mixing  with  TRIzol  LS  reagent  (Invitrogen,  Carlsbad,  CA)  at  a  3:1  ratio 
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(3  parts  TRIzol  to  1  part  sucrose),  and  extracting  was  done  according  to 
manufacturer’s  instructions,  followed  by  two  additional  salt  precipitations  [0.3 
mol/L  sodium  acetate  (pH  5.2)  with  3  volumes  of  EtOH]. 

Real-time  Quantitative  Reverse-transcriptase  PCR 

First-Strand  cDNA  Synthesis.  For  generation  of  first-strand  cDNA,  ~1 
/xg  of  RNA  was  reverse-transcribed  with  Superscript  II  Reverse  Transcriptase 
Kit  (Invitrogen)  in  the  presence  of  oligodeoxythymidylic  acid  (12-18)  in  a 
final  20- /llL  reaction  volume  with  reverse  transcriptase  per  manufacturer’s 
recommended  protocol  followed  by  RNase  H  treatment. 

Real-time  Reverse  Transcriptase  PCR  Setup.  Real-time  PCR  reactions 
were  done  with  DNase-free  cDNA  templates  generated  above  and  SYBR 
Green  PCR  Core  Reagents  (Applied  Biosystems,  Branchburg,  NJ)  following 
manufacturer’s  protocol  with  the  following  modifications:  a  25  /xL  reaction 
was  used,  which  contained  1 X  SYBR  Green  PCR  mix,  3  mmol/L  MgCl2,  1 X 
deoxynucleoside  triphosphate  blend  (0.2  mmol/L  of  dATP,  dCTP,  dGTP,  and 
0.4  mmol/L  of  dUTP),  0.625  units  of  AmpliTaq  Gold,  0.125  units  of  AmpEr- 
ase  UNG,  50  nmol/L  each  of  forward  and  reverse  primer,  and  1  juL  of  cDNA 
template.  Default  PCR  amplification  cycles  were  used  as  specified  by  the  ABI 
Prism  7700  Sequence  Detection  System  (Applied  Biosystems):  50°C  for  2 
minutes,  95 °C  for  10  minutes,  40  cycles  of  15  seconds  at  95°C,  and  60°C  for 
1  minute.  PCR  amplification  was  followed  by  melting  curve  analysis  with  the 
following  3  hold  cycles:  95°C  for  15  seconds,  60°C  for  20  seconds,  and  95°C 
for  15  seconds,  with  the  ramping  time  at  maximum  value  19:59  minutes  set  at 
the  last  hold  cycle.  PCR  amplification  analysis  was  done  on  Sequence  Detector 
v.l.7a,  and  melting  curves  were  analyzed  on  Dissociation  Curves  v.l  accord¬ 
ing  to  Applied  Biosystems  guidelines. 

MCF-7  cDNA  template  was  used  to  generate  a  relative  standard  curve  for 
either  the  endogenous  control  or  the  target  gene.  MCF-7  cDNA  was  serially 
diluted  at  1:10  dilution  factors  starting  with  the  highest  arbitrary  concentration 
of  50  ng//xL  (taken  from  one  twentieth  of  a  reverse  transcriptase  reaction  of  1 
fig  of  starting  total  RNA).  The  sample  templates  were  diluted  at  1:5.  All 
samples  were  done  in  triplicate.  The  sequence  of  the  primers  used  in  real  time 
reverse  transcriptase-PCR  is  as  follows:  endogenous  control  18S  rRNA:  F:5'- 
GTAACCCGTTGAACCCC ATT-3 ' ,  R:5  '-CCATCC  AATCGGTAGTAGCG-3 ' 
(6)  with  the  expected  size  of  150  bp;  and  junctional  adhesion  molecule 
(JAM  1 ) :  F:  5 '  -CCCTCTTGGCTTG  ATTTTGC-3 ' ,  R:  5 '  -TG  ACCTTG  ACT- 
GATGGCTTC-3'  with  the  expected  size  of  115  bp.  The  glyceraldehyde-3- 
phosphate  dehydrogenase  (GAPDH)  primers  were  obtained  from  Applied 
Biosystems  (Foster  City,  CA).  The  quantity  of  target  RNA  (either  JAM1  or 
GAPDH)  was  calculated  by  interpolating  from  the  standard  curve  generated 
for  that  specific  target.  Both  JAM1  and  GAPDH  quantities  were  normalized  to 
18S  rRNA  to  calculate  the  relative  quantity. 

Microarray  Hybridization  and  Data  Processing 

To  minimize  the  effects  of  technical  variability,  membrane-associated  and 
secreted  and  cytoplasmic  RNA  pools  from  one  fractionation  were  processed  in 
triplicate  as  follows.  In  three  parallel  reactions,  10  /x g  of  total  RNA  from  each 
pool  was  labeled,  hybridized  to  the  Affymetrix  U133A  microarray,  processed, 
and  scanned  according  to  standard  Affymetrix  protocols.  The  six  resulting 
CEL  files  (three  membrane-associated  and  secreted  and  three  cytoplasmic) 
were  processed  with  the  Bioconductor  software  suite  (a  set  of  libraries  for  R; 
ref.  7).  The  robust  multiarray  average  algorithm  (8-10)  was  used  for  normal¬ 
ization,  background  correction,  and  expression  value  calculation.  Membrane- 
associated  and  secreted/cytoplasmic  ratios  for  each  microarray  element  were 
calculated  by  taking  the  ratios  of  the  average  membrane-associated  and  se¬ 
creted  and  cytoplasmic  expression  values  for  that  microarray  element.  Log2 
transformed  data  (as  returned  by  the  robust  multiarray  average  algorithm)  were 
used  for  the  ratio  calculation. 

Membrane  and  Cytoplasmic  Gene  Reference  Set 

A  reference  set  was  developed  containing  genes  that  are  known  to  have 
either  membrane  or  cytoplasmic  location  and  are  represented  on  the  Af¬ 
fymetrix  U133A  microarray  with  an  automatic  database  search  based  on 
Swiss-Prot  (11)  release  44.  Each  Affymetrix  microarray  element  has  a  unique 
Affymetrix  Probe  ID  that  can  be  mapped  to  at  least  one  Swiss-Prot  accession 


number.4  Each  Swiss-Prot  entry  was  then  searched  for  “cellular  location” 
comment  tags.  Proteins  were  considered  to  have  membrane-associated  and 
secreted  localization  if  the  Swiss-Prot  cellular  location  tag  contained  one  of  the 
following  identifiers:  “secreted,”  “Golgi,”  “vesicular,”  “membrane,”  “lyso- 
some,”  or  “peroxisome.”  Entries  were  considered  tentative  if  they  contained 
“probable,”  “possible,”  “potential,”  and  “by  similarity”  and  considered  unam¬ 
biguous  if  not.  Entries  that  contained  “nuclear,”  “nucleus,”  and  “mitochon¬ 
drial”  were  removed  as  there  is  some  evidence  that  nuclear  and  mitochondrial 
proteins  can  be  synthesized  in  either  pathway  (12,  13).  This  resulting  list  was 
then  hand  edited  to  remove  entries  containing  multiple  isoforms  targeted  for 
different  subcellular  compartments.  Proteins  are  considered  to  have  cytoplas¬ 
mic  localization  if  the  Swiss-Prot  cellular  location  tag  contained  “cytoplasmic” 
or  “cytoplasm.”  Again,  entries  with  probable,  possible,  potential,  and  by 
similarity  were  considered  tentative,  and  entries  containing  nuclear,  nucleus, 
and  mitochondrial  were  removed.  This  list  was  hand  edited  to  remove  any 
entries  with  multiple  isoforms  as  well  as  entries  that  contained  any  references 
to  membrane  association  or  organelles. 

Statistical  Calculations 

At  a  given  membrane-associated  and  secreted/cytoplasmic  expression  ratio 
r,  the  probability  of  belonging  to  the  membrane-associated  and  secreted  class 
for  probe  sets  with  ratios  above  r  [p(m\R  >  r)]  is  calculated  by  using  Bayes’ 
rule  as  shown  in  Equation  1. 

p(R  >  r\m)  Pm 

p(m\R  >  r)  = - : — - : -  (1) 

p(R  >  r\m)Pm  +  p(R  >  r\c)Pc 

where  p{R  >  r\a)  is  the  proportion  of  class  a  above  ratio  r.  We  calculate  this 
probability  for  the  entire  range  of  membrane-associated  and  secreted/cytoplas¬ 
mic  ratios  at  intervals  of  0.01  and  choose  the  ratio  that  corresponds  to  the 
maximum  ratio  (we  choose  the  lowest  ratio  for  which  the  posterior  probability 
rises  to  within  10%  of  the  maximum  probability).  The  Pa  factor  corresponds  to 
the  prior  probability  of  belonging  to  class  a. 

Because  of  a  lack  of  previous  data  on  which  to  base  our  prior  probabilities, 
we  estimate  these  prior  probabilities  by  determining  the  contributions  of  the 
two  known  distributions  to  the  distribution  of  probe  sets  with  unknown 
localization  as  follows.  We  assume  the  distribution  of  membrane-associated 
and  secreted/cytoplasmic  ratios  for  the  unknown  set  will  be  approximated  by 
a  linear  combination  of  the  membrane-associated  and  secreted/cytoplasmic 
ratio  distributions  for  the  known  membrane-associated  and  secreted  and  known 
cytoplasmic  distributions,  as  shown  in  Equation  2. 

/ unknown  =  PE/mS  +  PfcYT  (2) 

To  estimate  the  contributions  of  the  two  known  distributions,  we  find  a  and  /3 
that  minimize  the  sum  of  the  squared  errors  between  these  two  quantities  as 
shown  in  Equation  3. 

min  ^ \funknoWn  ~  0 of  ms  +  P/cyt)?  (3) 

«>£ 

For  this  calculation,  we  use  discretized  data  (bins  of  width  0.01)  and  scale  the 
original  membrane-associated  and  secreted  and  cytoplasmic  distributions  to  a 
maximum  of  one. 

Sensitivity  is  defined  as  TP/(TP+FP),  specificity  is  defined  as  TN/ 
(FN+TN),  and  positive  predictive  value  is  defined  as  TP/(TP+FN),  where  TP 
(true  positives)  are  the  number  of  membrane-associated  and  secreted  genes  that 
are  labeled  correctly,  FN  (false  negatives)  are  the  number  of  membrane- 
associated  and  secreted  genes  that  are  labeled  incorrectly,  TN  (true  negatives) 
are  the  number  of  cytoplasmic  genes  that  are  labeled  correctly,  and  FP  (false 
positives)  are  the  number  of  cytoplasmic  genes  that  are  incorrectly  labeled. 
Sensitivity  is  a  measure  of  the  portion  of  membrane-associated  and  secreted 
genes  we  can  detect,  specificity  is  a  measure  of  the  portion  of  cytoplasmic 
genes  we  can  detect,  and  positive  predictive  value  is  a  measure  of  how  many 
predicted  membrane-associated  and  secreted  genes  are  truly  membrane-bound 
or  secreted. 


4  Web  address:  www.affymetrix.com/analysis/. 


RESULTS 

RNA  Fractionation  and  Verification.  The  fractionation  of  poly¬ 
somes  by  sucrose  density  gradient  centrifugation  was  first  described 
by  Mechler  (4),  and  it  was  based  on  the  observation  that  genes 
encoding  proteins  that  are  membrane-associated  or  secreted  are  trans¬ 
lated  by  ribosomes  bound  to  the  endoplasmic  reticulum  (membrane 
bound  polysomes),  whereas  genes  encoding  proteins  that  are  cytosolic 
are  translated  by  ribosomes  free  in  the  cytosol  (free  polysomes). 
Membrane-bound  polysomes,  being  less  dense,  rise  to  the  top  of  the 
gradient,  whereas  free  polysomes  remain  near  the  bottom  of  the 
gradient. 

Here,  the  method  was  used  to  separate  intact  polysomes  prepared 
from  the  MCF-7  breast  cancer  cell  line.  In  a  typical  fractionation,  the 
separation  results  in  two  distinct  peaks  of  A260  nm,  with  the  lower  peak 
representing  free  polysomes  and  the  upper  peak  containing  the  less 
dense  membrane-bound  polysomes.  To  prepare  sufficient  RNA  for 
Affymetrix  hybridizations,  it  was  necessary  to  fractionate  polysomes 
from  3  X  10s  cells;  the  results  of  this  procedure  are  shown  in  Fig.  1. 
Fractions  from  each  peak  were  pooled,  and  the  fractions  in  the  peak 
nearest  the  bottom  (2  to  15)  of  the  gradient  are  designated  cytoplas¬ 
mic,  whereas  fractions  in  the  peak  nearest  the  top  of  the  gradient  (20 
though  26)  are  designated  membrane  and  secreted.  Sucrose  fractions 
with  near-baseline  absorption  (16  though  19)  in  between  the  cytoplas¬ 
mic  and  membrane-associated  and  secreted  pools  were  saved  as 
negative  controls.  The  peak  at  the  surface  of  the  gradient  (the  top  1.5 
mL  fraction)  was  discarded. 

To  confirm  that  the  membrane-associated  and  secreted  and  cyto¬ 
plasmic  pools  were  enriched  for  membrane-associated  and  secreted- 
associated  and  cytoplasmic-associated  mRNA,  respectively,  real-time 
quantitative  reverse-transcriptase  PCR  was  done  with  two  primer 
pairs  expected  to  amplify  coding  sequences  specific  for  each  popula¬ 
tion.  We  reverse  transcribed  1  jug  of  total  RNA  each  from  the 
membrane-associated  and  secreted  and  cytoplasmic  pools  and  labeled 
the  resulting  cDNA  in  three  separate  reactions  for  each  pool.  JAM1  is 
primarily  cell-surface  associated,  whereas  GAPDH  is  a  protein  found 
free  in  the  cytoplasm.  Because  of  their  different  biological  sequester¬ 
ing,  we  expected  JAM1  to  be  more  highly  represented  in  the  mem¬ 
brane-bound  polysome  RNA,  whereas  the  opposite  will  be  true  for 
GAPDH.  To  confirm  the  physical  separation  of  these  two  RNAs,  the 
membrane-associated  and  secreted/cytoplasmic  expression  ratio  was 
calculated  (see  Table  1)  by  taking  the  ratios  of  the  averages  from  the 
triplicates.  As  seen  in  Table  1,  the  membrane-associated  and  secreted/ 


Fig.  1.  RNA  content  of  fractions  taken  from  the  sucrose  gradient.  Vertical  axis  shows 
the  A260  nm,  and  the  horizontal  axis  gives  the  fraction  number  from  the  bottom  of  the 
gradient. 


Table  1  Difference  in  RNA  expression  ratios  for  a  membrane-associated  and 
cytoplasmic  gene 


MS/CYT  ratio,  as  measured 

MS/CYT  ratio,  as  measured  by 

Target 

by  reverse  transcriptase-PCR 

Affymetrix  U133A  microarray 

GAPDH 

0.00064 

0.986 

JAM1 

0.387 

1.173 

Abbreviation:  MS/CYT,  membrane-associated  and  secreted/cytoplasmic. 


cytoplasmic  expression  ratio  is  about  1,000-fold  greater  for  JAM1 
than  for  GAPDH,  demonstrating  a  marked  enrichment  in  the  mem¬ 
brane-associated  and  secreted  pool  for  JAM1. 

The  RNA  pools  were  then  labeled  and  hybridized  to  Affymetrix 
U133A  microarrays.  Membrane-associated  and  secreted  expression 
and  cytoplasmic  expression  are  the  values  returned  by  the  robust 
multiarray  average  calculation  of  expression  measured  on  the  Af¬ 
fymetrix  array  hybridized  to  membrane-associated  and  secreted  and 
cytoplasmic  RNA,  respectively,  and  were  calculated  for  each  microar¬ 
ray  element  by  averaging  the  expression  value  across  the  appropriate 
triplicate  (supplementary  data).  The  membrane-associated  and  secreted/ 
cytoplasmic  ratio  for  GAPDH  ( AFFX-HUMGAPDH/M33 1 97_M_at) 
and  JAM1  (221664_s_at)  was  then  calculated,  as  shown  in  Table  1. 
As  expected,  the  membrane-associated  and  secreted  pool  shows  an 
enrichment  for  JAM1  as  compared  with  GAPDH,  whereas  the  cyto¬ 
plasmic  pool  shows  an  enrichment  for  GAPDH.  All  microarray  data 
are  available  at  the  Gene  Expression  Omnibus  (14)  as  accession 
number  GSE1400. 

Reference  Set  Construction.  Because  the  distribution  of  mem¬ 
brane-associated  and  secreted/cytoplasmic  ratios  for  either  class  is  not 
known  a  priori,  it  was  necessary  to  train  a  classifier  with  a  reference 
set  of  genes  with  known  subcellular  localization.  Of  all  22,283  ele¬ 
ments  on  the  Affymetrix  U133A  array,  subcellular  location  annota¬ 
tion,  as  described  in  Materials  and  Methods,  was  available  for  9,851 
elements.  Unambiguous  membrane-associated  and  secreted  annota¬ 
tion  was  found  for  3,188  of  these,  whereas  unambiguous  cytoplasmic 
annotation  was  found  for  798  elements.  These  elements  with  unam¬ 
biguous  annotation  represent  the  reference  set. 

Expression  Threshold  Calculation.  It  is  likely  that  only  a  subset 
of  the  elements  on  the  U133A  microarray  will  be  expressed  in  MCF-7 
cells  at  a  level  great  enough  for  meaningful  measurement.  To  deter¬ 
mine  that  level,  we  evaluated  our  ability  to  distinguish  known  mem¬ 
brane-associated  and  secreted  genes  from  known  cytosolic  genes  in 
the  reference  set  at  various  total  expression  (ET)  levels,  where  ET  = 
membrane-associated  and  secreted  expression  +  cytoplasmic  expres¬ 
sion,  where  membrane-associated  and  secreted  and  cytoplasmic  ex¬ 
pression  are  the  average  exponentiated  expression  values  for  the 
membrane-associated  and  secreted  and  cytoplasmic  microarrays,  re¬ 
spectively.  A  10-fold  cross  validation  was  done  at  increasing  thresh¬ 
old  levels  of  Et,  including  only  training  set  members  with  an  ET 
value  S  threshold.  Briefly,  for  each  ET  level,  the  data  were  randomly 
partitioned  into  10  groups,  9  of  which  were  used  as  a  “training”  set, 
and  the  remaining  group  was  designated  as  a  “testing”  set.  At  each  ET 
level,  the  membrane-associated  and  secreted/cytoplasmic  ratio  thresh¬ 
old  was  calculated  (as  described  in  Materials  and  Methods)  for  the 
training  set  at  that  ET  level.  The  positive  predictive  value,  sensitivity, 
and  specificity  were  calculated  by  examining  the  performance  of 
predicting  the  testing  set  for  that  ET  level,  and  averages  over  the  10 
groups  were  recorded.  The  results  of  these  calculations  are  shown  in 
Fig.  2.  As  shown  in  Fig.  2A,  the  performance  of  prediction  for  ET 
thresholds  ranging  from  22,283  (100%  of  the  microarray  elements)  to 
1,106  (4.9%  of  the  microarray  elements)  was  examined.  The  ET  level 
that  corresponded  to  the  highest  sensitivity  without  a  significant  drop 
in  positive  predictive  value  or  specificity  was  738.  At  this  level  only 
24.6%  of  probe  sets  with  the  highest  ET  are  included,  resulting  in  a 
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Fig.  2.  Identifying  the  optimal  ET  threshold  for  predicting  membrane-associated  and 
secreted  genes.  A.  The  number  of  probes  with  total  expression  level  Er  above  specific 
threshold  values  of  ET.  B.  The  percentage  of  correctly  labeled  membrane-associated  and 
secreted  and  cytoplasmic  genes  at  differing  ET  thresholds.  C.  The  number  of  known 
membrane-associated  and  secreted  genes  that  are  correctly  predicted  at  differing  Er 
thresholds.  D.  The  number  of  known  cytoplasmic  genes  that  are  correctly  predicted  at 
differing  ET  thresholds.  Averages  of  the  10-fold  cross  validation  results  are  plotted  in 
B-D. 


final  dataset  of  5,483  probe  sets  that  pass  this  threshold  filtering.  At 
this  Et  level,  our  membrane-associated  and  secreted  prior  probability 
estimate  is  7.2%,  with  a  corresponding  cytoplasmic  prior  probability 
of  92.8%.  Of  those  probe  sets  above  this  ET,  538  have  unambiguous 
membrane-associated  and  secreted  annotation  and  305  have  unambig¬ 
uous  cytoplasmic  annotation.  Additionally,  at  this  level,  our  10-fold 
cross-validation  yields  a  97.5%  positive  predictive  value  with  80.7% 
sensitivity  and  96.9%  specificity. 

Membrane-Associated  and  Secreted/Cytoplasmic  Ratio  Thresh¬ 
old  Calculation.  All  of  the  843  probe  sets  in  the  reference  set 
(with  an  ET  above  the  threshold  of  723)  were  used  to  determine  the 
membrane-associated  and  secreted/cytoplasmic  ratio  that  corresponds  to 
the  maximum  posterior  probability  of  belonging  to  the  membrane- 
associated  and  secreted  class.  The  distribution  of  membrane-associated 
and  secreted/cytoplasmic  ratios  for  genes  with  known  localizations  was 
examined  (Fig.  3).  It  is  interesting  to  note  that  the  cytoplasmic  genes 
show  a  discrete  peak,  whereas  the  membrane-associated  and  secreted 
genes  show  a  bimodal  distribution  with  a  smaller  peak  that  associates 
with  the  cytoplasmic  genes.  The  membrane-associated  and  secreted/ 
cytoplasmic  ratio  of  1 .08  was  calculated  as  giving  the  maximum  posterior 
probability.  Note  that  above  this  level,  the  majority  of  known  cytoplasmic 
genes  are  excluded  (only  3.2%  are  above  this  level),  and  a  sizeable 
fraction  of  the  known  membrane-associated  and  secreted  genes  (22%) 
show  a  lower  membrane-associated  and  secreted/cytoplasmic  ratio.  Thus, 
genes  with  a  ratio  below  1.08  cannot  be  designated  with  certainty  as 
either  membrane-associated  and  secreted  or  cytoplasmic. 

The  distribution  of  membrane-associated  and  secreted/cytoplasmic 
ratios  for  the  remaining  probe  sets  (genes  of  unknown  cellular  local¬ 
ization)  is  plotted  in  Fig.  4.  Of  these,  755  probe  sets  fall  above  the 
expression  threshold  and  above  the  membrane-associated  and  secret¬ 
ed/cytoplasmic  ratio  of  1.08.  These  755  probe  sets  are  labeled  as 
“predicted  membrane-associated  and  secreted.”  The  remaining  3,885 
probe  sets  found  above  the  expression  threshold  and  below  the 
membrane-associated  and  secreted/cytoplasmic  ratio  of  1.08  are  la¬ 
beled  as  “indeterminate,”  because  we  expect  a  mixture  of  cytoplasmic 


and  membrane-associated  and  secreted  genes  in  this  range  of 
membrane-associated  and  secreted/cytoplasmic  ratios.  Of  the  pre¬ 
dicted  membrane-associated  and  secreted  probe  sets,  323  were  found 
to  have  a  tentative  subcellular  annotation  but  did  not  meet  the  criteria 
previously  established  for  the  reference  set.  The  remaining  432  probe 
sets  have  no  subcellular  annotation.  A  similar  percentage  of  indeter¬ 
minate  probe  sets  were  found  to  have  some  tentative  subcellular 
annotation  (1516  of  3885). 

The  Swiss-Prot  annotations  were  searched  for  terms  that  might 


Fig.  3.  Distribution  of  membrane-associated  and  secreted/cytoplasmic  ratios  for  all  of 
the  genes  in  the  reference  set  expressing  above  the  ET.  The  midpoints  of  bins  from 
frequency  histograms  are  plotted  (for  visual  clarity,  bins  are  0.05  units  wide).  The  vertical 
line  indicates  a  membrane-associated  and  secreted/cytoplasmic  ratio  of  1.08.  The  distri¬ 
bution  of  membrane-associated  and  secreted  genes  is  plotted  with  dashed  lines,  whereas 
the  solid  line  indicates  the  distribution  of  cytoplasmic  genes.  ( MS/CYT ,  membrane- 
associated  and  secreted/cytoplasmic) 


MS/CYT  expression  ratio 


Fig.  4.  Distribution  of  membrane-associated  and  secreted/cytoplasmic  ratios  for  genes 
that  are  not  in  the  reference  set  expressing  above  the  ET.  The  midpoints  of  bins  from 
frequency  histograms  are  plotted  (for  visual  clarity,  bins  are  0.05  units  wide).  The  vertical 
line  indicates  a  membrane-associated  and  secreted/cytoplasmic  ratio  of  1.08.  ( MS/CYT , 
membrane-associated  and  secreted/cytoplasmic) 


Table  2  Tentative  subcellular  annotation  for  probe  sets  with  predicted  localization 


Total 

Probe  sets 

Tentative 

Tentative  nuclear  or 

Predicted 

probe 

with  tentative 

Tentative  MS 

cytoplasmic 

mitochondrial 

Conflicting 

location 

sets 

subcellular  annotation 

annotation 

annotation 

annotation 

annotation 

Other 

MS 

755 

323 

214  (69.8%) 

6(1.4%) 

56  (15.9%) 

15  (3.6%) 

32  (9.2%) 

Indeterminate 

3885 

1516 

113(7.5%) 

189(12.5%) 

961  (63.4%) 

219(14.4%) 

34  (2.2%) 

Abbreviation:  MS,  membrane-associated  and  secreted. 


indicate  a  tentative  assignment  to  a  cellular  fraction  (e.g.,  “membrane 
by  similarity”  or  “nuclear”).  Table  2  summarizes  the  tentative  local¬ 
ization  annotations  for  the  predicted  membrane-associated  and  se¬ 
creted  and  indeterminate  groups.  Seventy  percent  (214  of  323)  of  the 
predicted  membrane-associated  and  secreted  probe  sets  with  tentative 
annotations  indicate  a  membrane-associated  and  secreted  subcellular 
location.  The  binomial  probability  (with  the  prior  probability  of 
membrane-associated  and  secreted  class  membership  as  calculated  in 
Statistical  Calculations)  of  obtaining  this  number  of  membrane- 
associated  and  secreted  probe  sets  by  chance  is  very  low  {P  «< 
0.005),  indicating  that  the  probe  set  population  with  membrane- 
associated  and  secreted/cytoplasmic  ratios  £1.08  is  significantly  en¬ 
riched  for  membrane-associated  and  secreted  genes.  Less  than  2%  (6 
of  323)  of  the  predicted  membrane-associated  and  secreted  probe  sets 
with  tentative  annotation  are  thought  to  be  cytoplasmic.  The  remain¬ 
ing  probe  sets  have  either  conflicting  annotation  or  are  thought  to  be 
localized  to  the  nucleus,  the  endoplasmic  reticulum,  mitochondria,  or 
other  intracellular  locations.  Biochemical  process  annotation  was 
available  for  224  of  these  323  probe  sets  in  Gene  Ontology  (15).  Over 
half  of  these  seem  to  be  involved  in  metabolism,  whereas  one  third  are 
involved  in  cell  growth.  Almost  25%  of  the  predicted  membrane- 
associated  and  secreted  class  are  involved  in  cell  communication. 
(Although  these  annotations  seem  to  comprise  a  greater  number  than 
the  actual  number  of  annotated  probe  sets,  Gene  Ontology  is  orga¬ 
nized  in  a  way  such  that  multiple  annotations  can  correspond  to  a 
single  probe  set.) 

In  contrast,  7.5%  (113  of  1516)  of  the  indeterminate  probe  sets  with 
tentative  annotation  indicate  a  membrane-associated  and  secreted 
localization.  Although  only  12.5%  (189  of  1516)  of  these  are  thought 
to  be  cytoplasmic,  a  significant  fraction  of  the  probe  sets  with  con¬ 
flicting  annotation  indicate  a  possible  cytoplasmic  localization.  Inter¬ 
estingly,  >60%  of  the  indeterminate  probe  sets  contain  nuclear  or 
mitochondrial  annotation. 

Analysis  of  a  Gene  Expression  Study  for  Membrane-Associated 
and  Secreted  Gene  Content.  The  MCF-7  membrane-associated  and 
secreted  gene  dataset  was  used  to  filter  a  differential  gene  expression 
study  in  breast  cancer,  which  compared  tumors  with  good  versus  poor 
5-year  outcome  (5).  We  asked  whether  the  membrane-associated  and 
secreted  localization  provided  by  our  study  might  give  additional 
insight  into  the  interpretation  of  the  results  and  facilitate  the  selection 
of  target  genes  for  additional  evaluation. 

In  the  van’t  Veer  et  al.  (5)  study,  RNA  from  98  primary  breast 
tumors  was  hybridized  to  cDNA  microarrays,  and  the  resultant  anal¬ 
ysis  led  to  a  231 -gene  expression  profile  associated  with  poor  prog¬ 
nosis.  The  original  study  was  preformed  on  cDNA  glass  slide  mi¬ 
croarrays;  therefore,  we  needed  to  find  which  elements  of  the 
Affymetrix  U133A  microarray  corresponded  to  the  231  genes  from 
the  original  study.  It  was  possible  to  map  166  of  these  231  genes  to 
269  probe  sets  on  the  Affymetrix  microarray.  Of  these  269  probe  sets, 
20  were  found  in  our  predicted  membrane-associated  and  secreted 
database  representing  15  unique  genes  (see  Table  3);  an  additional  52 
were  found  in  our  training  set  of  previously  known  membrane- 
associated  and  secreted  genes.  Of  the  genes  not  in  the  training  set, 
almost  half  (7  of  15)  had  no  subcellular  location  annotation  in  Gene 
Ontology  or  Swiss-Prot,  although  one  had  a  published  characteriza¬ 


tion.  Of  the  9  genes  with  functional  annotation,  5  are  involved  in 
metabolism,  along  with  one  each  involved  in  signal  transduction, 
cell-cycle  regulation,  proteolysis,  and  calcium  binding.  It  is  interesting 
to  note  that  of  the  genes  without  functional  annotation,  HCCR1  is  a 
putative  proto-oncogene,  fucosyltransferase  8  is  thought  to  contribute 
to  malignancy,  “G  protein-coupled  receptor  126”  contains  a  “protein 
tyrosine  phosphatase-like  protein”  domain,  and  “hypothetical  protein 
FLJ22341”  contains  a  rhomboid  domain,  thought  to  regulate  epider¬ 
mal  growth  factor  receptor  expression.  Any  of  these  proteins,  the 
up-regulation  of  which  is  associated  with  poor  prognosis  in  breast 
cancer,  merit  additional  investigation  as  potential  treatment  targets. 

DISCUSSION 

We  describe  here  a  novel  set  of  membrane-associated  and  secreted 
genes  expressed  in  MCF-7  cells.  We  are  able  to  annotate  755  probe 
sets  as  membrane-associated  or  secreted,  432  of  which  had  no  previ¬ 
ous  subcellular  location  annotation.  Two  levels  of  validation 
strengthen  our  location  predictions.  First,  we  did  10-fold  cross  vali¬ 
dation  on  the  set  of  genes  with  annotated  localization,  which  is  a 
robust  method  for  estimating  performance  on  future  datasets  with 
similar  characteristics.  On  the  basis  of  the  results  of  the  10-fold  cross 
validation,  it  is  likely  that  a  great  number  of  the  predicted  membrane- 
associated  and  secreted  genes  will  have  membrane-associated  and 
secreted  localization.  This  is  reflected  by  the  average  97%  positive 
predictive  value  observed  in  the  10-fold  cross  validation.  Second,  we 
examined  the  tentative  annotations  of  genes  in  the  set  that  were  not 
used  in  the  cross  validation  test  and  for  which  we  predicted  subcel¬ 
lular  localization.  Many  of  these  have  some  tentative  annotation, 
which  we  do  not  consider  definitive.  Nevertheless,  our  membrane- 
associated  and  secreted  predictions  coincide  with  these  tentative  an¬ 
notations  70%  of  the  time. 

Here  we  describe  a  general  method  of  applying  density  gradient 
fractionation  of  RNA  to  the  Affymetrix  platform,  including  a  robust 
statistical  analysis.  Furthermore,  we  have  described  an  approach  that 
can  easily  be  modified  for  other  tissues  or  states  for  comparative 
studies. 

To  minimize  technical  variability,  hybridization  data  were  collected 
in  triplicate,  with  three  independent  labeling  experiments  on  RNA 
collected  from  one  fractionation  experiment.  It  was  not  possible  to 
compare  the  results  obtained  from  multiple  fractionations  of  different 
cell  cultures  because  of  the  prohibitive  cost  of  processing  these  large 
volumes  of  cells  and  of  Affymetrix  hybridization.  Thus,  the  results 
shown  here  represent  a  “snapshot”  of  a  cell  line  at  a  single  point  in 
time;  it  is  possible  that  the  representation  of  some  genes,  and  even 
their  membrane-associated  and  secreted/cytoplasmic  distribution,  will 
vary  with  different  culture  conditions.  Indeed,  this  approach  might  be 
used  to  investigate  global  changes  in  subcellular  distribution  of  pro¬ 
teins  under  various  biological  conditions,  which  to  our  knowledge  has 
not  been  addressed  previously. 

Our  Bayesian  analysis  may  be  over-  or  underestimating  membrane- 
associated  and  secreted  localization  because  of  some  violations  of  the 
equation  assumptions.  The  localization  of  different  genes  are  not 
entirely  independent  observations.  For  instance,  there  are  clearly 
genes  that  colocalize  because  of  genetic  interactions.  In  addition,  we 


Table  3  Predicted  membrane-associated  and  secreted  genes  in  a  breast  cancer  expression  dataset  (see  text  for  details) 


Affymetrix  ID 

Original 
accession  no. 

Gene  name 

Description 

Localization  annotation 
(GO  and  Swiss-Prot) 

MS/CYT 

ratio 

212640_at 

AF052159 

Homo  sapiens  clone  24416  mRNA  sequence 
Homo  sapiens  cDNA  FLJ20738  fis,  clone 

None 

1.294 

212248_at 

AK000745 

HEP08257 

FLJ20738  fis,  clone 

None 

1.261 

212250_at 

AK000745 

HEP08257 

Homo  sapiens  cDNA 

FLJ20738  fis,  clone 

None 

1.232 

212251  at 

AK000745 

HEP08257 

None 

1.217 

201818_at 

AF052162 

FLJ 12443 

Hypothetical  protein  FLJ  12443 

None 

1.205 

218686  s  at 

Contig55188_RC 

FLJ22341 

Hypothetical  protein  FLJ22341 

None 

1.116 

219202_at 

Contig55188_RC 

FLJ22341 

Hypothetical  protein  FLJ22341 

Cervical  cancer  1 

None 

1.133 

207170  s  at 

NM_015416 

HCCR1 

Proto-oncogene 

None 

1.080 

201037_at 

D25328 

PFKP 

Phosphofructokinase,  platelet 

None 

Not  annotated,  but  literature 
suggests  secreted  protein 

1.115 

219197_s_at 

NM_020974 

CEGP1 

CEGP1  protein  protein  disulfide  isomerase 
related  protein  (calcium-binding  protein, 
intestinal-related) 

1.327 

20865  8_at 

NM_004911 

ERP70 

Protein  disulfide  isomerase  related  protein 
(calcium-binding  protein,  intestinal-related) 

Endoplasmic  reticulum 

1.221 

211048  s  at 

NM_004911 

ERP70 

Endoplasmic  reticulum 

1.263 

210074_at 

NM_001333 

CTSL2 

Cathepsin  L2 

Homo  sapiens  mRNA;  cDNA 

DKFZp564D016  (from  clone 

DKFZp564D016) 

Lysosome 

Membrane  protein 

1.310 

1.212 

212290_at 

212295  s  at 

AL050021 

AL050021 

Homo  sapiens  mRNA;  cDNA 

DKFZp564D016  (from  clone 

DKFZp564D016) 

Hypothetical  protein 

Membrane  protein 

1.223 

213094_at 

AL080079 

DKFZP564D0462 

DKFZp564D0462 

Membrane  protein 

1.345 

219410_at 

NM_0 18004 

FLJ10134 

Hypothetical  protein  FLJ  10134 

Membrane  protein 

1.210 

221675_s_at 

NM_020244 

LOC56994 

Cholinephosphotransferase  1 

Membrane  protein 

1.356 

20398 8_s_at 

NM_004480 

FUT8 

Fucosyltransferase  8  (alpha  (1,6) 
fucosyltransferase) 

Membrane  protein 
(by  similarity). 

1.206 

203362_s_at 

NM_002358 

MAD2L1 

MAD2  (mitotic  arrest  deficient,  yeast, 
homolog)-like  1 

Nucleus 

1.112 

Abbreviations:  GO,  Gene  Ontology;  MS/CYT,  membrane-associated  and  secreted/cytoplasmic. 


make  the  assumption  that  these  two  classes  are  mutually  exclusive, 
which  may  not  be  true  for  a  small  fraction  of  genes.  The  robust 
multiarray  average  algorithm  might  be  a  different  source  of  under¬ 
estimation  for  membrane-associated  and  secreted  prediction,  as  it  uses 
quantile  normalization  and  might  be  overcorrecting  for  underrepre¬ 
sented  membrane-associated  and  secreted  genes.  It  is  possible  that 
alternative  microarray  processing  algorithms  may  yield  additional 
predicted  membrane-associated  and  secreted  genes.  Despite  these 
drawbacks,  we  believe  this  will  be  a  useful  tool  for  investigators 
wishing  to  filter  existing  or  future  breast  cancer  Affymetrix  datasets  to 
look  for  membrane-associated  and  secreted  genes.  Alternative  statis¬ 
tical  methods  may  be  useful  for  additional  analysis  and  confirmation 
of  our  results. 

There  are  a  significant  number  of  genes  with  unambiguous 
membrane-associated  and  secreted  annotation  that  fall  below  our 
membrane-associated  and  secreted/cytoplasmic  threshold.  It  is  unclear 
if  this  is  because  of  a  real  biological  process  (some  of  those  mem¬ 
brane-associated  and  secreted  genes  are  not  membrane-associated  and 
secreted  localized  in  MCF-7  cells,  for  instance)  or  a  processing 
artifact.  Additional  experimental  analysis  is  needed  to  elucidate  the 
mechanism  in  action.  Additional  study  is  also  needed  to  determine 
whether  the  protein  localization  we  discovered  for  MCF-7  cells  holds 
true  when  analyzing  other  breast  cancer  cells. 
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